跳至主要内容

Feature Engineering

Feature engineering means converting raw data into useful input columns for a model.

Raw events are usually too detailed for basic model training.

For example:

user_001 viewed item SKU_1001 at 2023-01-01 10:00
user_001 added item SKU_1001 to cart at 2023-01-01 10:01
user_001 purchased at 2023-02-03 20:00

The model does not need every event as a separate row.

For our example, we summarize events into one row per user.

Feature Examples

FeatureMeaning
recencyHow many days since the user's last event
frequencyHow many sessions the user had
view_item_countHow many product views the user had
add_to_cart_countHow many times the user added items to cart
avg_price_viewedAverage price of viewed products

These features summarize user behavior.

Label Example

The label is:

will_purchase

The value is:

  • 1: the user has a purchase event
  • 0: the user does not have a purchase event

Feature Table Shape

The final training table should look like this:

user_idrecencyfrequencyview_item_countadd_to_cart_countavg_price_viewedlabel
user_0011512342.51
user_0021411020.00

Each row has:

  • One user id
  • Several feature columns
  • One label column

Build Features With pandas

Assume we already created these DataFrames in the data preparation step:

feature_events

This DataFrame comes from the flattened GA4 mock data.

For feature columns, remove purchase rows first.

The purchase event will be used as the label.

behavior_events = feature_events[
feature_events["event_name"] != "purchase"
].copy()

Before grouping, create helper columns.

behavior_events = behavior_events.copy()

behavior_events["is_view_item"] = (
behavior_events["event_name"] == "view_item"
).astype(int)

behavior_events["is_add_to_cart"] = (
behavior_events["event_name"] == "add_to_cart"
).astype(int)

behavior_events["view_item_price"] = behavior_events["price"].where(
behavior_events["event_name"] == "view_item"
)

Why create these columns?

  • is_view_item makes product views easy to count
  • is_add_to_cart makes cart events easy to count
  • view_item_price makes sure we only average prices from product view events

Create user-level behavior features.

import pandas as pd

reference_date = feature_events["event_date"].max()

features = (
behavior_events
.groupby("user_id")
.agg(
last_event_date=("event_date", "max"),
frequency=("session_id", "nunique"),
view_item_count=("is_view_item", "sum"),
add_to_cart_count=("is_add_to_cart", "sum"),
avg_price_viewed=("view_item_price", "mean"),
)
.reset_index()
)

Create a recency feature.

features["recency"] = (
reference_date - features["last_event_date"]
).dt.days

Remove helper columns that should not be used by the model.

features = features.drop(columns=["last_event_date"])

Build Labels With pandas

Find users who have a purchase event.

purchase_users = (
feature_events
.loc[feature_events["event_name"] == "purchase", "user_id"]
.drop_duplicates()
)

Create the label column.

labels = pd.DataFrame({
"user_id": purchase_users,
"label": 1,
})

Join features and labels.

training_data = features.merge(labels, on="user_id", how="left")
training_data["label"] = training_data["label"].fillna(0).astype(int)

Handle Missing Feature Values

Some users may not have viewed any product, so avg_price_viewed may be missing.

For a beginner example, fill it with 0.

training_data["avg_price_viewed"] = training_data["avg_price_viewed"].fillna(0)

Later, we can use scikit-learn's SimpleImputer to handle missing values inside a model pipeline.

Preview The Training Table

training_data.head()

Check the label distribution.

training_data["label"].value_counts()

If almost every row has the same label, the model may have difficulty learning useful patterns.

Select Feature Columns

The model should use behavior columns, not the user id.

feature_cols = [
"recency",
"frequency",
"view_item_count",
"add_to_cart_count",
"avg_price_viewed",
]

X = training_data[feature_cols]
y = training_data["label"]

Feature Engineering Rules

Good beginner rules:

  • Use features that are available before prediction time
  • Keep one row meaning consistent
  • Avoid using ID columns as model features
  • Create simple count, average, max, min, and date-difference features first
  • Check null values before training
  • Keep feature names easy to understand

Feature engineering is usually the most important part of a beginner machine learning project.